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ABSTRACT 



The present invention provides for the detection of human 
heads, faces and eyes in real-time and in uncontrolled 
environments. The present invention may be implemented 
with commercially available components, such as a standard 
video camera and a frame grabber, on a personal computer 
(PC) platform. The approach used by the present invention 
is based on a probabilistic framework that uses a deformable 
template model to describe the human face. The present 
invention works both with simple head-and-shoulder video 
sequences, as well as with complex video scenes with 
multiple people and random motion. The present invention 
is able to locate the eyes from different head poses (rotations 
in image plane as well as in depth). The information 
provided by the location of the eyes may be used to extract 
faces from a frontal pose in a video sequence. The extracted 
frontal frames can be passed to recognition and classification 
systems (or the like) for further processing. 

8 Claims, 12 Drawing Sheets 
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SYSTEM AND METHOD FOR DETECTING A SUMMARY OF THE INVENTION 

HUMAN FACE IN UNCONTROLLED „, . . .. , t . . t . fu 

i7isjviDn\nv>f uvtc Th e P resenl invention provides for the detection of human 

heads, faces and eyes in real-time and in uncontrolled 

CROSS-REFERENCE TO RELATED PATENT s environments. The present invention may be implemented 

APPLICATIONS w * tn °° mmerc i a lly available components, such as a standard 

video camera and a frame grabber, on a personal computer 

This patent application is a continuation-in-part of (PC) platform. The approach used by the present invention 

co-pending U.S. Provisional Patent Application Ser. No. is based on a probabilistic framework that uses a deformable 

60/031,816, entitled "Real-Time Detection of Human Faces 1Q template model to describe the human face. The present 

in Uncontrolled Environments", filed Nov. 26, 1996. invention works both with simple head-and -shoulder video 

sequences, as well as with complex video scenes with 

BACKGROUND OF THE INVENTION multiple people and random motion. The present invention 

1. Field of the Invention ^ ^ c t0 l° calc tnc cvcs ^ rom different head poses (rotations 
t, . . II i . . « . • . , 1C in image plane as well as in depth). The information 
The present invention generally relates to real-time video 15 & r u t t 

• ^ i . . b -i „ t , . . c provided by the location of the eyes may be used to extract 

image analysis, and more specifically to the detection of K c c . \ -J -m. , j 

u„„f p a *u- i .• -j faces from a frontal pose in a video sequence. The extracted 

human races and eyes within real-time video images. c „ . f . r . , .7. , . t , 

. . J & frontal frames can be passed to recognition and classification 

2. Description of the Prior Art systems (or the like) for further processing. 
In recent years, the detection of human faces from video 

data has become a popular research topic. There are numer- BRIEF DESCRIPTION OF THE DRAWINGS 

ous commercial applications of face detection, such as in . . . , , . . 

face recognition, verification, classification, identification as FIG ' 1 15 a block dia S ram of thc P rcscnt 

well as security access and multimedia. To extract the FIG - 2 is a flow diagram depicting the overall operation 

human face in an uncontrolled environment, most prior art °f lne present invention. 

techniques attempt to overcome the difficulty of dealing with FIG. 3 is a flow diagram depicting a process for choosing 

issues such as variations in lighting, variations in pose, the most likely model of people within the video image, 

occlusion of people by other people, and cluttered or non- FIG. 4 is a flow diagram further depicting the modeling 

uniform backgrounds. process of FIG. 3. 

In one prior art face detection technique, an example- 30 piG. 5 is a flow diagram depicting a process for fitting an 

based learning approach for locating unoccluded human eUipse around the head 0 f a person detected within a video 

frontal faces is used. The approach measures a distance image. 

between the local image and a few view-based "face" and p, GS 6A _ fiD ?A _ 7C 8A _g D and 9A _ 9D d jc , 

non face pattern prototypes at each image location to to of video { that be ^ „ F [he 

locate the face. In another technique, the distance to a "face 35 prese ^ t i nve ntion 

space", defined by "eigenfaces", is used to locate and track ^ f ^ + n , . . . , 

frontal human faces. In yet another prior art technique, FIG. 10 depicts cntena that may be used to model a face 

human faces are detected by searching for significant facial Wlthin a Vlde0 ima S e * 

features at each location in the image. Finally, in other FIGS. 11-12 are flow diagrams depicting processes that 

techniques, a deformable template based approach is used to 40 are performed by the present invention. 

detect faces and to extract facial features. 

In addition to the detection of faces within video image DETAILED DESCW^ON OF THE 

TNVFNTION 

sequences, prior art systems have attempted to detect eyes 

on human heads. For example, Challepa et al., "Human and Apreferred embodiment of the invention is now described 

Machine Recognition of Faces: A Survey", Proceedings of 45 in detail. Referring to the drawings, like reference numerals 

the IEEE, vol. 83, no. 5, pp. 705-740, May 1995, described indicate like components and/or steps throughout the views, 

a process for detecting eyes on a human head, where the 1. The Video System 

video image includes a front view of the head. For frontal FIG. 1 depicts the overall structure of the present inven- 

views, eye detection that is based on geometrical measures ti on in one embodiment. The hardware components of the 

has been extensively studied, by, for example, Stringa, 50 present invention may consist of standard off-the-shelf com- 

"Eyes Detection for Face Recognition", Applied Artificial ponents. The primary components in the system are one or 

Intelligence, vol. 7, no. 4, pp. 365-382, October-December more video cameras 110, one or more frame grabbers 120, 

1993 and Brunelli et al., "Face Recognition: Features versus a nd a processing system 130, such as a personal computer 

Templates", IEEE Transaction on Pattern Analysis and (PC). The combination of the PC 130 and frame grabber 120 

Machine Intelligence, October 1993. Additionally, Yuilee et 55 may collectively be referred to as a "video processor" 140. 

al., "Feature Extraction from Faces Using Deformable The video processor 140 receives a standard video signal 

Templates", International Journal of Computer Vision, vol. f orma t 115, such as RS-170, NTSC, CCIR, PAL, from one 

8, pp. 299-311, 1992, describe a deformable template-based or more 0 f tne cameras 110, which can be monochrome or 

approach to facial feature detection. However, these meth- co lor. In a preferred embodiment, the camera(s) 110 may be 

ods may lead to significant problems in the analysis of 60 mounted or positioned to view a selected area of interest, 

profile or back views. Moreover, the underlying assumption SU ch as within a retail establishment or other suitable 

of dealing only with frontal faces is simply not valid for location. 

real-world applications. The video signal 115 is input to the frame grabber 120. In 

There is therefore a significant need in the art for a system one embodiment, the frame grabber 120 may comprise a 

that can quickly, reliably and flexibly detect the existence of 65 Meteor Color Frame Grabber, available from Matrox. The 

a face or faces with in a video image, and that can also extract frame grabber 120 operates to convert the analog video 

various features of each face, such as eyes. signal 115 into a digital image stored within the memory 
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135, which can be processed by the video processor 140. For 
example, in one implementation, the frame grabber 120 may 
convert the video signal 115 into a 640x480 (NTSQ or 
768x576 (PAL) color image. The color image may consist of 
three color planes, commonly referred to as YUV or YIQ. 
Each pixel in a color plane may have 8 bits of resolution, 
which is sufficient for most purposes. Of course, a variety of 
other digital image formats and resolutions may be used as 
well, as will be recognized by one of ordinary skill. 



FIGS. 9C and 9D depict an example foreground region 
with two people 902, 903 with occluded bodies. In this case, 
the system 130 of the present invention selects the two 
people model (xl, x2, x3) to best represent the data. When 
a single person model is used to describe the foreground 
region, the large dashed ellipse 921 is fitted which does not 
correspond to any of the people's 902, 903 heads. The 
system does not select the single person model because the 
probability of one person model for the given input data is 
As representations of the stream of digital images from 1Q lower than the probability of the two person model given the 
the camera(s) 110 are sequentially stored in memory 135, input data. 

analysis of the video image may begin. All analysis accord- The next overall stage 202 in the present invention is the 
ing to the teachings of the present invention may be per- detection of eyes from varying poses and the extraction of 
formed by the processing system 130, but may also be those faces that correspond to frontal views. In prior art 
performed by any other suitable means. Such analysis is articles, such as those described by Turk et al., "Face 
described in further detail below. 15 Recognition Using Eigenfaces", Proceedings on Interna- 

2. Overall Process Performed by the Invention tional Conference on Pattern Recognition, 1991 and 

An overall flow diagram depicting the process performed Brunelli et al., "Face Recognition: Features versus 
by the processing system 130 of the present invention is Templates", IEEE Transactions on Pattern Analysis and 
shown in FIG. 2. The first overall stage 201 performed by Machine Intelligence, vol. 15, no. 10, October 1993, tecb- 
the processing system 130 is the detection of one or more 20 niques have been proposed whereby eyes are detected from 
human heads (or equivalent) within the video image from frontal views. However, the assumption of frontal view faces 
camera 110, which is stored in memory 135, and the second is not valid for real world applications, 
overall stage 202 is the detection of any eyes associated with In the present invention, in steps 221-222 the most 
the detected human head(s). The output 230 of stages significant face features are detected by analyzing the con- 
201-202 may be passed to recognition and classification 25 nected regions of large deviations from facial statistics, 
systems (or the like) for further processing. Region size and anthropological measure -based filtering 

The steps performed in stage 201 are described below. detect the eyes and the frontal faces. Eye detection based 

The first steps 212-213 and 216 (of the head detection upon anthropological measures for frontal views has been 
stage 201) is the segmentation of people in the foreground studied in the prior art (see, e.g., Brunelli et al., cited 
regions of the sequence of video images stored in memory 30 previously). However, such methods can run into problems 
135 over time, which is represented in FIG. 2 as video in the analysis of profile or back views of faces. In step 223, 
sequence 211. Such segmentation is accomplished by back- filtering based on detected region size is able to remove big 
ground modeling (step 216), background subtraction and connected components corresponding to hair as well as 
thresholding (step 212) and connected component analysis small regions generated by noise or shadow effects. In step 
(step 213). Assuming the original image 600 of FIG. 6A 35 224, the remaining components are filtered considering the 
(which may be stored in memory 135, etc.), as shown in anthropological features of human eyes for frontal views, 
FIG. 6B, the result of steps 212 and 213 is a set of connected and again the output 230 may be passed to another system 
regions (blobs) (e.g., blobs 601) which have large deviations for further processing. The eye detection stage 202 of the 
from the background image. The connected components 601 present invention is described in further detail below, 
are then filtered also in step 213 to remove insignificant 40 3. Segmentation of Foreground Regions 
blobs due to shadow, noise and lighting variations, resulting To extract moving objects within the video image stored 
in, for example, the blobs 602 in FIG. 6C. in memory 135, the background may be modeled as a texture 

To detect the head of people whose bodies are occluded, with the intensity of each point modeled by a Gaussian 
a model-based approach is used (steps 214-215, 217). In this distribution with mean p and variance a, N^,a) (step 216). 
approach, different foreground models (step 217) may be 45 The pixels in the image are classified as foreground if 
used for the case where there is one person in a foreground p(0(x,y)|N 6 (u,a))^T and as background if p(0(x,y)|N fc («, --s 
region and the case where there are two people in a fore- o))>T. The observation 0(x,y) represents the intensity of the \ 
ground region. The output of step 214 are the probabilities pixels at location (x,y), and T is a constant (step 212). 
of the input given each of the foreground region models. The connectivity analysis (step 213) of the "foreground" 

Step 215 selects the model that best describes the foreground 50 pixels generates connected sets of pixels, i.e. sets of pixels 
region by selecting the maximum probability computed in that are adjacent or touching. Each of the above sets of pixels 
step 214. An example output of step 215 is shown in FIG, describe a foreground region. Small foreground regions are 
6D, wherein the ellipse 603 is generated. assumed to be due to shadow, camera noise and lighting 

The functionality performed by system 130 of steps variations and are removed. 
214-215 and 217 is illustrated in FIGS. 9A-9D. Each of 55 4. The Foreground Region Modeling System 
FIGS. 9A-9D represent a video image that may be created The foreground regions are analyzed in further detail in 
by frame grabber 120 and stored in memory 135 (FIG. 1). steps 214-215 and 217 to detect the head. It is known that 
FIG. 9A depicts an example foreground region representing if there is only one head in the image, then it may be detected 
one person 901. The one person model (xl, x2) matches the by finding the upper region in each set of connected fore- 
input data. FIG. 9B depicts the same foreground region 60 ground regions. However, this technique fails when people 
modeled as two persons (xl, x2, x3). In this case two dashed in an image are occluded by other people. In this case, a 
ellipses 911, 912 are fitted but they do not represent the foreground region may correspond to two or more people, 
correct location of the head 913. The probability of the and finding the regions corresponding to heads requires a 
foreground region is computed for each model as is more complicated approach. In the case of partial people 
described later and the system automatically selects the 65 occlusion, in which bodies are occluded by other bodies, but 
model for one person to best describe the foreground region heads are not occluded, special processing must be per- 
in this case. formed. 
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To determine the head positions in this case, the number 
of people in each foreground region must be determined. As 
shown in FIG. 3, in order to determine the number of people 
within the video image, N separate models X f (301), (where 
i may equal 1 to N) may be built, each model \- 301 5 
corresponding to i people in a set of connected foreground 
region. Based on the assumption that faces are vertical and 
are not occluded, the model parameters for model \> are 
(Xo,x lf . . . xj where i is the number of people and x k (where 
k-1 to i) specifies the horizontal coordinates of the vertical 10 
boundaries that separate the i head region in model \ r The 
approach used to determine the number of people in each 
foreground region is to select in step 215 the model k f 301 
for which the maximum likelihood is achieved: 



15 



= arg max P{0{x, y)\X } ) 
/e[iA| 



(1) 



where the observations 0(x,y) are the pixel intensities at 
coordinates (x,y) in the foreground regions and P(0(x,y)|X,) 20 
is the likelihood functions for the \ th model 301, 

The probability computation steps 302 in FIG. 3 deter- 
mines the likelihood functions for each model 301. In step 
215, the observations 0(x,y) in the foreground regions are 
used to find for each rr\rA*\%. ant-jfrg npifmai set of 25 
parameters (x 0 ,x a , . . . x^thattfnaximize P(0(x,y)pQ, i.e. to 
find the parameters (x 0 ,x 1 , . .^^^Jhar^best^^segment the 
foreground regions (step 215). It will be shown later that the 
computation of P(0(x,y)|\) for each set of model param- 
eters 301 requires an efficient head detection algorithm 30 
inside each rectangular window bordered by x^ and 
x,, . . . , i. 

It is common to approximate the support of the human 
face by an ellipse. In addition, it has been determined that 
the ellipse aspect ratio of the human face is, for many 35 
situations, invariant to rotations in the image plane as well 
as rotations in depth. Based on the above, the head model 
301 is parameterized by the set (x 0 , y 0 ,a,b), where Xq and y 0 
are the coordinates of the ellipse centroid and a and b are the 
axis of the ellipse. The set (x 0 ,y 0 ,a,b) is determined through 
an efficient ellipse fitting process described elsewhere with 
respect to FIG. 5. 

5. Computation of Foreground Model Likelihood Functions 
Based on the assumption that human faces are vertical and 
are not occluded, it is deemed appropriate to parameterize 45 
models X, 301 over the set of parameters (x 0 ,x lt . . . xj which 
are the horizontal coordinates of the vertical borders that 
separate individual faces in each foreground region. The set 
of M nirrieters (xq^, . . . x^ is computed iteratively to 
maximize P(0)jx,y)|\). In a Hidden Markov Model (HMM) 50 
implementation (described further in Rabiner et al., "A 
Tuto jia^on Hidden Markov Models and Selected Applica- 
tions in Speech Recognition", Proceedings of the IEEE, 
February 1989), this corresponds to the training phase in 
which the model parameters are optimized to best describe 55 
the observed data. 

To define the likelihood functions P(0(x > y)|X i ) a prelimi- 
nary discussion about the head detection process algorithm 
may be helpful. In the present invention, the head is deter- 
mined by fitting an ellipse around the upper portions of the 60 
foreground regions inside each area bounded by Xy.^x, 
j=l, . . . , i. The head detection problem is reduced to finding 
the set of parameters (Xo,y 0 ,a,b) that describe an ellipse type 
deformable template (step 402 in FIG. 4). Parameters x 0 and 
y 0 describe the ellipse centroid coordinates and a and b are 65 
the ellipse axis. The ellipse fitting algorithm is described in 
more detail with respect to FIG. 5. 



For each set of parameters (xo,y 0 ,a,b) a rectangular tem- 
plate (W in FIG. 10) is defined by the set of parameters 
(x 0 ,y 0 ,aa,ab), where Xg and y 0 are the coordinates of the 
center of the rectangle and aa,ab are the width and length 
of the rectangle, and a is some constant (see FIG. 10). In 
each area bounded by x y _ 1 ,x / -, R wv - is the set of pixels 
outside the ellipse template and inside the rectangle template 
and R f • is the set of pixels inside the ellipse template (FIG. 
10). The regions K inj and R^^ locally classify the image in 
"face" and "non face" regions. Based on the above 
discussion, the likelihood function P(0(x,y)|X. / ) for the 
model ^ is determined by the ratio of the number of 
foreground pixels classified as "face" and background pixels 
classified as "non face" in each area bounded by x^x,-, 
(where j=l to i), aver the total number of pixels in "face" and 
"non face" regions (step 403). This is described in Equation 
(2) below: 



f\0(x t yUi) = 



(2) 



. u ( 1. if piO{x,y)\N b (n,o-))>T 

where y) = < 

[ 0, otherwise 

(U if 
"Ho, cti 



and /(: 



p(Ofr. ymUi, 0")) < T 
otherwise 



(3) 



(4) 



The goal in steps 301-302 is not only to compute the 
d^£yjKtions P(0(x,y)|\) for a set of parameters 
„x 1( . . . x,-), but also to determine the set of parameters that 
maximize P(0(x,y)|\). The initial parameters (x 0 ,x 1 , . . . x f ) 
for model \-<301 are chosen to uniformly segment the data, 
Xj-x y _ 1 =(x i -x 0 )/i(where j=l to i). As described in FIG. 4, 
the parameters (x^x^ . . . x f ) are iteratively adjusted to 
maximize P(0(x,y)|X/) (step 404). The iterations are termi- 
nated if the difference of the likelihood functions in two 
consecutive iterations is smaller than a threshold (step 405). 

In a two person model in one embodiment, x a is the only 
parameter that is iteratively adjusted for the estimation of the 
model. The computation of the likelihood function for a two 
person model is described in the following steps. The 
reference numerals within [brackets] correspond to like- 
numbered steps illustrated in FIG. 11: 

[1101] the initial values of (x^x^^ are determined such 
that x 0 is the x coordinate of the leftmost point of the 
foreground region, x 2 is the x coordinate of the rightmost 
point of the foreground region and x i s»(x 0 +x 2 )/2. 

[1102] the ellipse fitting process (step 402 in FIG. 4) is 
performed in each of the two vertical slots bounded by 
(x^J and (x^Xj) pairs. The ellipse fitting algorithm will 
be described in more detail later with respect to FIG. 5. 

[1103] for the ellipses found, the following parameters are 
computed (step 403 in FIG. 4): 
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point between the left-most and the right-most points 

Z ^ (4A j for each line of the blob. The x 0 ( ** :1) coordinate of the 

centroid of the ellipse at iteration k+1 is located on the 
vertical skeleton at the line y 0 w corresponding to the 



?*. ^ + ^ ^ 5 current centroid position. Hence x 0 ( ***' 1) will be 



uniquely determined as a function of y 0 ^. 



t.^.j ^'^(y a { % (7) 

uo^Wj where function f 2 is a function determined by the 

vertical blob skeleton. 

[1104] reesimale the value of x, according to the following ^ b P? r i me , ter ° f 0™ elli ?f ( th u e Ien 8 ,h ) is generally 

formula- verv difficult 10 obtain with high accuracy due to the 

15 difficulties in finding the chin line. However, generally 
the length to width ratio of the ellipse can be considered 

x^W^VKVo-^oHVi-Wl (4C) constant, such as M. Then, from Equation (6): 

where /i is a constant arround 20. ti^-M-a^-M-f (y < ky ) (8) 
[1105] compute P^XJ from Equation (2). If the difference 20 

between P(0|>^) for consecutive values of parameter x 1 is From Equation (5) we write: 
smaller than a threshold, stop iterations. The parameters 

of the ellipses given by the ellipse fitting algorithm, (^Vrfx (9) 

performed in each slot bounded by (x^Xj) and (xj,x 2 ), 0 

will determine the location and size of the people heads in 25 Equations (6), (7), (8) and (9) lead to: 

the foreground region. If the difference between P(0|>^) 

for consecutive values of parameter x x is bigger than the 

same threshold, then go to step 1102. ^^W*^ (io) 

6. The Iterative Ellipse Fitting Algorithm which describes thc iterative ellipse-fitting process algo- 

In step 402, the head within a video image is detected by 30 ^ of ^ m invemion Equation (10 ) indicates that 

iteratively fitting an ellipse around the upper portion of the we have reduced tne four . d i m ensional problem of finding 

foreground region inside the area bounded by x^x, (where the ellipse pararaeler s to an implicit equation with one 

j=l, . . . i). The objective in an ellipse fitting algorithm is to unknown y 0 

find the x Q ,y 0 ,a and b parameters of the ellipse such that: with this ° in mind> the ellipse fittmg process ^ illustrated 

35 in further detail in FIG. 5. In step 503 the edges and the 

(( - v ^i 2 u v« 2 r<\ vertical skeleton of the foreground regions in the area 

ur wa) +[v-yo)tt>) -i { 5 } bordercd by x ._ i>X/are extracted. After the extraction of the 

A general prior art technique for fitting the ellipse around skeletons of the foreground regions, the y 0 parameter of the 

the detected blobs in step 402 (FIG. 4) is the use of the 40 elli P se fe iteratively computed. 

Hough Transform, described by Chellapa et al. in "Human In one embodiment, the initial y coordinate of the ellipse 

and Machine Recognition of Faces: A Survey", Proceedings centroid, y 0 ( } is chosen close enough to the top of the object 

of the IEEE, vol. 83, no. 5, pp. 705-740, May 1993. 00 the vertical skeleton in order for the algorithm to perform 

However, the computational complexity of the Hough wel1 for a11 tv P es of sequences from head-and-shoulder to 

Transform approach, as well as the need for a robust edge 45 full-body sequence (step 504). Typically the initial value of 

detection algorithm, make it ineffective for real-time appli- v ° te selected according to the following expression: 
cations. 

Abetter alternative for fitting the ellipse in step 402 (FIG. y o (D) ~y,+0.l>(y r yb) (11) 

4) is an inexpensive recursive technique that reduces the ° 

search for the ellipse parameters from a four dimensional 50 whe re Y, is the y coordinate of the highest point of the 

space x 0 ,y 0 ,a,b to a one dimensional space. The parameter skeleton and y b is the y coordinate of the lowest point of the 

space of the ellipse is reduced based on the following skeleton. Given the initial point y 0 <°>, the ellipse fitting 

observations: algorithm iterates through the following loop to estimate the 

The width of the ellipse at iteration k+1 is equal to the eUi P se Peelers. The reference numerals in [brackets] 

distance between the right most and left most point of 55 refer 10 the Steps aiustrated 10 ™. n 

the blob at the line corresponding to the current cen- [1201] compute parameter 2a w by measuring the distance 

troid position, y 0 w i.e. between the left and the right edges of the blob. 

[1202] compute parameter b w by measuring the y dis- 
tance between y 0 w and the highest point of the skel- 

^-/iCvo^. (6) 60 eton. 

[ 1203] compute the error e(k) (in step 505), 



where function f A is determined by the boundary of the 
objects resulting from the connected component analy- 
sis. 



<#)-fcf«-AfaW (12) 



The centroid of the ellipse is located on the so-called 65 In sum, the goal of the ellipse fitting algorithm described 
"vertical skeleton" of the blob representing the person. herein is to minimize this value, i.e. to find the ellipse that 
The vertical skeleton is computed by taking the middle best satisfies the condition b=Ma, M«1.4. 
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[1204] compute the new value y 0 c * +1) (slep 506) using a upper portion of the face (W face-up 1002 in FIG. 10). Inside 

linear estimate given by the following equation: this window, the hair corresponds to the region of large size. 

Stage 202 of FIG. 2 illustrates the steps of the eye 
detection approach that may be used according to the present 

( 12A ) 5 invention. In step 221, the pixel intensities inside the face 
regions are compared to a threshold 6, and pixels with 

[1205] if the distance between two consecutive centroids intensities lower than 6 are extracted from the face region, 

is smaller than a threshold, stop the iterations. When the i n step 222, and as shown in FIG. 7 A, the connectivity 

iterations stop, Xo,y 0 ,a and b describe the four param- analysis of the extracted pixels generates connected sets of 

eters of the ellipse. Otherwise, go to step 1203. 10 pixels (e.g., pixels 701), i.e. sets of pixels that are adjacent 

The above iterations converge to the ellipse parameters 0 r touching. Each of these connected sets of pixels 701 

for an ellipse type contour. From equation (1), the distance describe a low intensity region of the face, 

between the right most and left most point of the ellipse i n ste p 223, the pixel regions 701 resulting from steps 

corresponding to y 0 w is determined by: 221-222 are filtered with respect to the region size. Regions 

15 having a small number of pixels due to camera noise or 

shadows are removed. Large regions generally cannot rep- 

W-W\-{{yjn-y$Md? (13) rese nt eyes, but instead correspond in general to hair. The 

• ,i j . . . t t , t c 4 . ... , (k s , size of the regions selected at this stage is in the interval 

and the distance between the top of the ellipse and y 0 w is m o i u n • .u jo • „u 

determined by [ e ~» 0 A/] where e m is the minimum and Q M is the maximum 

y 20 number of pixels allowed by our system to describe a valid 

eye region. Threshold values 6^,8^ are determined based on 

tfto^ytfrMa-yj® (14) me s i ze of. tne ellipse that characterizes the head region (the 

ellipse being generated iteratively in step 215). The end 

Hence, for ^-1, equation (1) becomes: result of step 223 is an image 702, such as that shown in FIG. 

25 7B. 

In step 224, the remaining components within the image 

y 0 ^ l) -yo=Ma-M^u{(y 0 M-y o )/Ma) 2 (15) 0 f FIG. 7B are filtered based on anthropological measures, 

„ Al _ , ... , , such as the geometrical distances between eyes and the 

rrom the above equation it can be proved that A , A ° t - e 4 , ... , . . 

H p expected position of the eyes inside a rectangular window 

30 (eye band) centered in the ellipse centroid. The eye regions 

ty 0 <* +1 >-;y 0 | 2 <ly 0 <*>-jK 0 | 2 are determined by analyzing the minimum and maximum 

distance between the regions inside this band. The output 

for any y 0 <*> for which |y 0 <;r) -y 0 |<Ma. This shows that the 230 of step 224 is an image, such as shown in FIG. 7C, 

recurrence defined in equation (10) converges to y 0 . whereby the eyes 703 have been detected. 

7. Eye Detection Process 35 The present invention may be implemented on a variety of 

The ellipses detected from stage 201, and as described different video sequences from camera 110. FIGS. 8A, 8B, 

previously, are potentially the region of support for human 8C and 8D depict the results obtained by operating the 

faces. After the detection of these regions, a more refined present invention in a sample laboratory environment, based 

model for the face is required in order to determine which of upon the teachings above. FIGS. 8A-8D comprise four 

the detected regions correspond to valid faces. The use of the 40 different scenarios generated to demonstrate the perfor- 

eye detection process of stage 202, in conjunction with the mances under different conditions such as non-frontal poses, 

head detection stage 201, improves the accuracy of the head multiple occluding people back views, and faces with 

model and removes regions corresponding to back views of glasses. In FIG. 8 A, the face 812 of a single person 811 is 

faces or other regions that do not correspond to a face. Eye detected, via ellipse 813. In this figure, the ellipse 813 is 

detection results can also be used to estimate the face pose 45 properly fitted around the face 812 and the eyes 814 are 

and to determine the image containing the most frontal poses detected even though the person 811 is wearing optical 

among a sequence of images. This result may then be used glasses on his face 812. 

in recognition and classification systems. FIG. 8B shows the back view of a single person 821 in the 

The present invention may use an eye-detection algorithm video scene. In this figure, the ellipse 823 is fitted around the 

based on both region size and geometrical measure filtering. 50 head of the person 821, but no eye is detected, indicating the 

The exclusive use of geometrical measures to detect the eyes robustness of the eye detection stage 202 of the present 

inside a rectangular window around the ellipse centroid (eye invention. 

band: W eye 1001 in FIG. 10) may lead to problems in the FIGS. 8C and 8D show two scenarios in which two people 

analysis of non-frontal faces. In these cases, the hair regions 831 A and 83 IB are present in the scene. In both figures the 

inside the eye band generate small hair regions that are not 55 body of one person 831B is covering part of the body of the 

connected to each other and that are in general close in size other person 831 A. In both cases, ellipses 833 A and 833B 

and intensity to the eye regions. Under the assumption of are positioned around the faces 832 A and 832B, and eyes 

varying poses, the simple inspection of geometrical dis- 834A and 834B are detected. In FIG. 8D, the face 832A of 

tances between regions and positions inside the eye band the person 83 LA in the back has a non-frontal position. Also 

cannot indicate which regions correspond to the eyes. 60 due to different distances from the camera 110, the size of 

Hence, a more difficult approach based on region shape can the two faces 832A and 832B are different. The faces 832A 

be taken into account. However, in the present invention, a and 832B of both persons 831 A and 83 IB are detected 

simple method may be implemented to discriminate eye and indicating the robustness of the system to variations in 

hair regions that perform with good results for a large parameters such as size and position of the faces 832 A and 

number of video image sequences. In this approach, the 65 832B. 

small hair regions inside the eye band are removed by Although the present invention has been described with 

analyzing the region sizes in a larger window around the particular reference to certain preferred embodiments 
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thereof, variations and modifications of the present inven- 
tion can be effected within the spirit and scope of the 
following claims. 
What is claimed is: 

1. A system for detecting a face within a video image, 
comprising: 

(a) a video camera; 

(b) means for storing an image from the video camera; 
and 

(c) processing means coupled to the video camera and the 
storing means for performing the steps of: 

(i) storing a background image from the video camera 
in the storing means; 

(ii) storing a video image from the video camera in the 
storing means; 

(iii) subtracting the background image from video 
image stored in the storing means; 

(iv) identifying a region within the video image that 
surpasses a selected high intensity threshold; 

(v) comparing the identified region to at least one 
model of a face; 

(vi) selecting one of the at least one model that best 
describes the identified region; 

(vii) generating parameters associated with an ellipse 
that corresponds to the identified region, responsive 
to step (vi); 

(viii) identifying sub-regions within the identified 
region that are below a selected low-intensity thresh- 
old; 

(ix) filtering out sub -regions below a selected small size 
or above a selected large size; 

(x) comparing the remaining sub -regions to at least one 
anthropological model defining eyes; and 

(xi) generating parameters corresponding to the 
remaining sub-regions, responsive to step (x). 

2. The system of claim 1, further comprising a second 
processing means for receiving the generated parameters 
and for performing further processing of the stored video 
image based upon the generated parameters. 

3. The system of claim 1, further comprising a second 
processing means for receiving the generated parameters 
and for performing further processing of the stored video 
image based upon the generated parameters. 

4. A process for detecting a face within a video image, 
wherein the video image is generated by a video camera and 
stored with a storage device, comprising the steps of: 

(a) storing a background image from the video camera in 
the storage device; 

(b) storing a video image from the video camera in the 
storage device; 

(c) subtracting the background image from video image 
stored in the storage device; 

(d) identifying a region within the video image that 
surpasses a selected high intensity threshold; 

(e) comparing the identified region to at least one model 
of a face; 

(f) selecting one of the at least one model that best 
describes the identified region; and 

(g) generating parameters associated with an ellipse that 
corresponds to the identified region, responsive to step 

(0; 

(h) identifying sub-regions within the identified region 
that are below a selected low-intensity threshold; 

(i) filtering out sub-regions below a selected small size or 
above a selected large size; 
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(j) comparing the remaining sub-regions to at least one 
anthropological model defining eyes; and 

(k) generating parameters corresponding to the remaining 
sub -regions, responsive to step (j). 

5. The process of claim 4, further comprising the step of 
performing further processing of the stored video image 
based upon the generated parameters. 

6. The process of claim 4, further comprising the step of 
performing further processing of the stored video image 

10 based upon the generated parameters. 

7. The system of claim 1, wherein the processing means 
further performs the steps of: 

(1) computing the vertical skeleton of the video image; 
35 (2) estimating an initial ellipse centroid from the highest 
and lowest point of the vertical skeleton; 

(3) measuring the width between the left and the right 
edges of the video image; 

(4) measuring the length between the highest point of the 
20 vertical skeleton and the y coordinate of the ellipse 

centroid at the current iteration; 

(5) computing the error e<*> associated with the currently 
determined ellipse parameters according to the expres- 
sion: 

25 

wherein b w is the distance between the highest point of 
30 the vertical skeleton and the y coordinate of the ellipse 
centroid at the k 1 * iteration, a w is the ellipse width at 
the current k th iteration, and M is the desired ratio of 
ellipse length to width. 

(6) computing a new centroid value according to the error 
35 associated with the ellipse parameters; and 

(7) repeating steps (l)-(6) until the distance between the 
new centroid and the centroid of the previous iteration 
is smaller than a selected threshold. 

8. The process of claim 4, further comprising the steps of: 

(1) computing the vertical skeleton of the video image; 

(2) estimating an initial ellipse centroid from the highest 
and lowest point of the vertical skeleton; 

(3) measuring the width between the left and the right 
45 edges of the video image; 

(4) measuring the length between the highest point of the 
vertical skeleton and the y coordinate of the ellipse 
centroid at the current iteration; 

(5) computing the error e w associated with the currently 
50 determined ellipse parameters according to the expres- 
sion: 

AW 

55 

wherein b w is distance the between the highest point of 
the vertical skeleton and the y coordinate of the ellipse 
centroid at the k 1 * iteration, a w is the ellipse width at 
the current k* A iteration, and M is the desired ratio of 
60 ellipse length to width; 

(6) computing a new centroid value according to the error 
associated with the ellipse parameters; and 

(7) repeating steps (l}~(6) until the distance between the 
new centroid and the centroid of the previous iteration 

65 is smaller than a selected threshold. 

***** 
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